Back

Data in Brief

Elsevier BV

Preprints posted in the last 90 days, ranked by how well they match Data in Brief's content profile, based on 13 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
Tracing mobility among Eneolithic-Bronze Age Kurgan populations in the North Pontic steppe

Nikitin, A. G.; Renson, V.; Ivanova, S.; Neff, N. C.; Straioto, H.; Svyryd, S.

2026-03-24 evolutionary biology 10.64898/2026.03.21.713323 medRxiv
Top 0.1%
2.6%
Show abstract

Five millennia ago, nomadic people from the North Pontic steppe left a profound impact on the course of Eurasian prehistory. However, little is known about their mobility patterns within their home region. To address this knowledge gap, we conducted a survey of the strontium isotope landscape of people interred in the 4th-3rd millennium BCE burial mounds (kurgans) of the western part of the North Pontic steppe. By analyzing the strontium signature in human bone and dentin, we established strontium baseline values for the region. We subsequently correlated enamel strontium ratios from 25 selected individuals with the baseline obtained and with published strontium data across the North Pontic steppe. Enamel strontium ratios show that some individuals interred in the northwest North Pontic fall within the regional baseline range, whereas others overlap with values reported for the eastern North Pontic steppe. In conjunction with carbon ({delta}13C) and nitrogen ({delta}15N) stable isotope data, we further determined that some individuals interred in the western Pontic steppe either spent the later part of life in the west Caspian steppe or were affected by physiological stress during lifetime. By integrating our data with published isotopic datasets, we produced a first baseline heatmap of the North Pontic steppe for the c. 4000-2000 BCE chronological period.

2
Development and fit for purpose validation of a quantitative LC-MS/MS method for heparan sulfate in cerebrospinal fluid as a biomarker for mucopolysaccharidosis type IIIA

Bystrom, C.; Douglass, K.; Gupta, M.

2026-03-30 genetic and genomic medicine 10.64898/2026.03.27.26348847 medRxiv
Top 0.1%
2.1%
Show abstract

Background: Mucopolysaccharidosis type IIIA (MPS IIIA; Sanfilippo syndrome) is a fatal neurodegenerative lysosomal storage disorder caused by impaired degradation of heparan sulfate (HS). Despite rapid advances in gene and enzyme therapies, there remains a critical need for an analytically validated, quantitative biomarker that accurately reflects central nervous system (CNS) substrate burden. Such biomarker would be a valuable tool in assessing disease progression and monitoring therapeutic efficacy. Objective: This study describes the method development, fit for purpose validation, and preliminary clinical application of a quantitative liquid chromatography-mass spectrometry (LC-MS/MS) assay for the HS-derived disaccharide N-sulfoglucosamine-glucuronic acid (GlcNS-GlcUA) in human cerebrospinal fluid (CSF), a critical biomarker for diagnosis, disease monitoring, and regulatory evaluation of emerging MPS IIIA therapies. Methods: A structurally defined GlcNS-GlcUA reference standard and its [13C6]-labeled internal standard were used in a derivatization and detection workflow employing 1-phenyl-3-methyl-5-pyrazolone labeling, and LC-MS/MS. Results: The method exhibited acceptable linearity across 0.005-0.500 nmol/mL (r[≥]0.9976), with intra- and inter-assay imprecision [≤]3.5%CV and accuracy within 95%-110% of nominal concentrations. No matrix or hemolysis interference or carryover was observed, and the analyte remained stable during freeze-thaw storage conditions. Application of the method to 12 CSF samples from patients with MPS IIIA demonstrated quantifiable GlcNS-GlcUA levels ranging from 0.0054 to 0.106 nmol/mL, confirming suitability for clinical and regulatory use. Comparison of the MPS IIIA sample results between the development laboratory and the contract research organization laboratory support robust inter-lab assay transfer. Conclusions: This validated LC-MS/MS method establishes a regulatory-grade quantitative assay for measurement of CSF HS in MPS IIIA. Its high analytical sensitivity and reproducibility enable reliable assessment of CNS substrate reduction and pharmacodynamic response, supporting biomarker-driven therapeutic development and accelerated approval pathways for neuronopathic mucopolysaccharidoses.

3
Phosphoproteomics in Daphnia magna as a tool to decipher molecular mechanisms in ecotoxicological studies

Wilde, M. V.; Stöckl, J. B.; Kösters, M.; Rupprecht, M. M.; Brehm, J.; Schwarzer, M.; Otte, K. A.; Laforsch, C.; Fröhlich, T.

2026-05-05 pharmacology and toxicology 10.64898/2026.05.01.721871 medRxiv
Top 0.1%
1.9%
Show abstract

Pollution of aquatic environments poses an increasingly severe threat to ecosystems worldwide, and understanding its molecular consequences for aquatic organisms requires extensive research and the development of advanced analytical tools. Phosphoproteomics can be particularly valuable for this purpose, as shifts in phosphorylation states can serve as early molecular indicators of toxic exposure. The cladoceran Daphnia is a keystone species in aquatic ecosystems, linking lower and higher trophic levels, and is therefore widely used as a model organism in ecotoxicology to study biological consequences of pollution. Here, we present a simple and effective strategy to analyse the phosphoproteome of Daphnia magna, a commonly used Daphnia species in ecotoxicology. Following TiO2-based phosphopeptide enrichment and LC-MS/MS analysis, we identified a comprehensive dataset of 3,532 phosphorylation sites across 1,329 phosphoproteins. These proteins were especially involved in signaling pathways and cellular structure and the vast majority have not yet been demonstrated in other Daphnia species. In conclusion, our results demonstrate that a straightforward phosphoproteomic LC-MS/MS workflow in D. magna can serve as a powerful tool for investigating adverse molecular effects caused by anthropogenic pollution, such as microplastics or pharmaceuticals. Statement of significanceThe dataset presented here demonstrates the feasibility of a simple yet effective strategy to perform phosphoprotemics in Daphnia magna, and it will be particularly valuable for future ecotoxicoproteomics research using this model organism.

4
Protocol for measuring endocrine disruptive effects on transcriptional bursting using single-molecule imaging in human breast cancer cells

Yasar, P.; Day, C. R.; Rodriguez, J.

2026-05-05 cell biology 10.64898/2026.05.01.722245 medRxiv
Top 0.1%
1.7%
Show abstract

Transcriptional bursts regulate gene expression by altering burst size or burst frequency. Here, we present a protocol that integrates fixed-cell smFISH and live-cell single-molecule imaging to analyze estrogen-responsive transcriptional bursting of the TFF1 gene in human breast cancer cell lines. This workflow enables measurement of burst size, burst initiation, and active allele frequency to determine how endocrine disruptor chemicals modulate transcriptional bursting dynamics. For complete details on the use and execution of this protocol, please refer to Day, Yasar et al.1

5
Archaeological preservation of amelogenesis pathways

Asmundsdottir, R. D.; Troche, G.; Olsen, J. V.; Martinez de Pinillos, M.; Martinon-Torres, M.; Schrader, S.; Welker, F.

2026-03-26 evolutionary biology 10.64898/2026.03.25.713862 medRxiv
Top 0.1%
1.7%
Show abstract

Dental enamel, the hardest mineralised tissue in the human body, has proven to be an excellent source of ancient proteins, which have been found to survive within dental enamel for at least twenty million years. In archaeological and palaeontological contexts, the enamel proteome is generally considered to be rather small, consisting of about twelve proteins, most of which are unique to enamel. During amelogenesis these proteins undergo in vivo digestion by matrix metalloproteinase 20 (MMP20) and kallikrein 4 (KLK4) as well as serine phosphorylation by family with sequence similarity member 20-C (FAM20C) that alter their characteristics. Gaining knowledge of the previously understudied influence of amelogenesis on the archaeological human dental enamel proteome could benefit various palaeoproteomic analysis, especially in an human evolutionary context. Here we present archaeological dental enamel proteomes and explore protein cleavage patterns and sequence coverage to estimate the effects of in vivo digestion, as well as explore phosphorylation patterns. Additionally, we present a new marker based on phosphorylation to estimate genetic sex.

6
Performance of Road-Traffic-Based Exposure Proxies Against Personal PM2.5 Measurements in Three Sub-Saharan African Countries

Nyoni, H. B.; Mushore, T. D.; Munthali, L.; Makhanya, S. A.; Chikoko, L.; Luchters, S.; Chersich, M. F.; Machingura, F.; Makacha, L.; Barratt, B.; Mistry, H. D.; Volvert, M.-L.; von Dadelszen, P.; Roca, A.; D'alessandro, U.; Temmerman, M.; Sevene, E.; Govindasamy, T. R.; Makanga, P. T.; The PRECISE Network, ; The HE<sup>2</sup>AT Centre,

2026-03-17 public and global health 10.64898/2026.03.13.26348337 medRxiv
Top 0.1%
1.7%
Show abstract

IntroductionParticulate Matter (PM2.5) exposure contributes to the global disease burden, yet its monitoring remains sparse and uneven and is limited in many limited ground monitoring network settings. Road-traffic proxy indicators can provide indirect estimates of PM2.5 where measurements are limited but require context-specific validation. We evaluated three PM2.5 road-traffic related proxies:(I) population-Weighted Road Network Density (WRND), (ii) Euclidean (straight line) distance from highways (EH), and (iii) Euclidean distance from main roads (EM). MethodsWe validated proxies using high-resolution outdoor filtered PM2.5 personal exposure measurements collected over 1 year from 343 postpartum participants in The Gambia, Kenya, and Mozambique. Village-level spatial patterns for the PM2.5-proxy relationship were mapped using 5 km hexagonal aggregated tessellations. Proxy-PM2.5 associations were assessed using Spearman correlation, and predictive utility was tested using country-specific and global Random Forest (RF) models (3-fold cross-validation), reporting R2, RMSE, and feature importance ResultsSpatial mapping showed heterogeneous proxy-PM2.5 relationships across and within sites, with elevated PM2.5 occurring in both low- and high-proxy contests. WRND-PM2.5 correlations were weak overall and statistically significant only in Mozambique (r = 0.351; p = 0.005), with non-significant associations in Kenya (r = -0.041; p = 0.673) and The Gambia (r = -0.020; p = 0.909). EH-PM2.5 correlations were positive in The Gambia (r = 0.335; p = 0.053) and Mozambique (r = 0.292; p = 0.020) but negative and significant in Kenya (r = -0.224; p = 0.018).Single-variable RF models performed poorly across all countries (R2 < 0.45) and the Global model (R2=0.42). Combining proxies improved performance in Kenya (R2=0.52; RMSE=31.7{micro}g/m3) and Mozambique (R2=0.60; RMSE=8.9 {micro}g/m3), Global R2=0.46; RMSE=29.1 {micro}g/m3), although in The Gambia, the combined model (R2=0.53; RMSE=37.6 {micro}g/m3) did not exceed the best single-proxy model. ConclusionRoad-network proxies provide context-dependent signals of personal PM2.5 exposure, and predictive performance is strengthened when proxies are combined in a hybrid model.

7
A general methodology for liver sinusoid fenestration analysis based on 3D electron microscopy data

Pohar, C.; Rekik, Y.; Phan, M. S.; Gallet, B.; Desroches-Castane, A.; Chevallet, M.; Tinevez, J.-Y.; Tillet, E.; Vigano, N.; Jouneau, P.-H.; Deniaud, A.

2026-03-09 cell biology 10.64898/2026.03.07.710307 medRxiv
Top 0.1%
1.7%
Show abstract

The liver has a complex architecture composed of millions of lobules. Within these lobules, hepatocytes, the main hepatic cells, are organized in rows separated by blood capillaries known as sinusoids. These capillaries are lined by liver sinusoidal endothelial cells (LSEC) that form a very specific fenestrated endothelium essential for the exchange of metabolites and proteins between the blood and hepatocytes. Alterations in the size and number of LSEC fenestrations are associated with the onset and the progression of various liver diseases. The analysis of liver architecture is thus of utmost importance for advancing our knowledge of liver ultrastructure and its alterations. Liver architecture has been studied since decades, mainly using 2D electron microscopy, and more recently using advanced super-resolution fluorescence microscopy. In recent years, volume electron microscopy techniques, including focused ion beam-scanning electron microscopy (FIB-SEM) progressed and nowadays enable the 3D reconstruction of biological ultrastructures down to nanometer resolution. However, the analysis of large volumes (e.g., several tens of {micro}m3) remains challenging due to various constraints in the segmentation of large datasets. In the current study, we developed a workflow to semi-automatically segment hepatic sinusoids from FIB-SEM mice liver datasets using the CNN-based (convolutional neural network) tool known as "nnU-Net", after fine-tuning a ground truth model. We also implemented tools for semi-automatic quantification of LSEC fenestrae diameters and sinusoid porosity from segmented datasets. This workflow enabled us to compare the distribution of LSEC fenestrae diameters in wild-type versus Bmp9-deleted mice, a hepatic factor known to be involved in fenestration maintenance. Our results confirm the importance of BMP9 for LSEC differentiation. Therefore, the developed methodology represents a valuable tool for characterizing the fenestrated endothelium under various physiological and pathological conditions.

8
Short-term Air Pollution Exposure and Risk of Airway Inflammatory Response in Children (CHERISH): Protocol for a Randomised Mixed Factorial Study

Moloney, S.; Hajmohammadi, H.; Wood, H. E.; Mead, M. I.; Mudway, I. S.; Mosler, G.; Thomson, A. C.; Gonzalez Calvo, I.; Scales, J.; Whitehouse, A.

2026-05-28 public and global health 10.64898/2026.05.28.26353607 medRxiv
Top 0.1%
1.7%
Show abstract

Introduction Air pollution is the largest environmental risk to human health. Children are disproportionately affected by air pollution and their exposure is amplified during physical activity. Observed concentrations of nitrogen dioxide in 1 in 4 London school playground exceeds the European limit, but the health impacts of air pollution exposure in London school playgrounds remain unexplored. Our study aims to assess and compare the acute changes in lung function and airway inflammation of primary school-aged children exercising in school playgrounds. Methods and analysis 330 children aged 8 to 11 years from ten London schools will be recruited to complete 90 minutes of physical activity and 90 minutes of rest in their school playground in a randomised crossover design. Pre-, post-, and 24-hour post-exposure oscillometry measurements will be performed with airway resistance at 5 Hz (R5) the primary physiological outcome. Nasal lavage samples will be collected pre-exposure and 24-hour post-exposure for analysis of inflammatory, oxidative, and vascular biomarkers, with IL-6 as the primary biological outcome. Mixed-effects regression models will examine associations between estimated pollutant exposures, exercise and physiological responses.

9
Dried blood spot proteomics as a diagnostic framework for citrin deficiency

Totsune, E.; Nakajima, D.; Konno, R.; Mikami-Saito, Y.; Arai-Ichinoi, N.; Nishida, H.; Yagi, H.; Ishige, T.; Suzuki, H.; Shirota, M.; Takayama, J.; Takano-Asai, C.; Shimura, M.; Sasai, H.; Lee, T.; Kido, J.; Nakajima, Y.; Kobayashi, H.; Kikuchi, A.; Numakura, C.; Hamazaki, T.; Oishi, K.; Nakamura, K.; Kawashima, Y.; Ohara, O.; Wada, Y.

2026-05-28 genetic and genomic medicine 10.64898/2026.05.26.26354012 medRxiv
Top 0.1%
1.5%
Show abstract

Background: Citrin deficiency, caused by biallelic pathogenic variants in SLC25A13, must be identified early to prevent serious complications such as hyperammonemia and liver failure. However, clinical diagnosis is often delayed due to its nonspecific presentation and limited sensitivity of amino acid-based newborn screening methods. Although genome-based evaluations are being investigated to address these issues, concerns about their cost, turnaround time, variant interpretation ability, and data handling highlight the need for a more practical yet reliable alternative. We investigated the feasibility of applying proteomic approach on dried blood spots (DBS), which are routinely used in newborn screening. Methods: We performed untargeted liquid chromatography-tandem mass spectrometry to analyze the proteome of DBS using a previously developed "non-targeted analysis of non-specifically DBS-absorbed proteins" (NANDA) workflow. SLC25A13 protein abundance was quantified in individuals with biallelic loss-of-function mutations, compound loss-of-function/missense mutations, and heterozygous carriers; this was also evaluated in healthy and diseased controls representing relevant differential diagnoses. To leverage proteomic information, we derived a multivariate proteomic signature using feature selection and evaluated its performance with leave-one-out cross-validation. Biological relevance was assessed by enrichment analysis, and complementary transcriptomics was performed using RNA sequencing. Results: A total of 7,474 proteins, including SLC25A13, were consistently detected in DBS. SLC25A13 was undetectable in individuals with biallelic loss-of-function mutations. However, individuals with compound loss-of-function/missense genotypes showed reduced but measurable SLC25A13 levels, comparable to those observed in heterozygous carriers. In contrast, a compact 15-protein signature accurately identified individuals with compound loss-of-function/missense genotypes (AUC, 0.99; sensitivity, 1.00; specificity, 0.95). The signature was enriched for Ca2+-response, and transcriptomics showed downregulation of genes related to multimodal ion channels in affected individuals compared to controls. Conclusions: DBS-based proteomic profiling may assist in the diagnosis of citrin deficiency through SLC25A13-quantification and a biologically plausible multivariate signature. More broadly, this strategy offers a promising new diagnostic layer for protein disorders, providing a proteomic readout in a clinically practical DBS format with potential utility for future diagnostic and screening applications.

10
Beyond dairy: Identification of dental enamel proteins in ancient human dental calculus

Leite, A.; Welker, F.; Godinho, R. M.; Gillis, R. E.; Islas, V. V.; Fagernas, Z.

2026-03-24 evolutionary biology 10.64898/2026.03.21.713223 medRxiv
Top 0.1%
1.5%
Show abstract

Ancient human dental calculus is one of the richest archives of archaeological biomolecular information, providing direct evidence of diet, oral health, and the oral microbiome. Proteomic analyses of this biological matrix have so far focused mainly on oral microbes and dietary proteins, with milk proteins such as beta-lactoglobulin (BLG) providing the largest corpus of proteomic evidence. Despite the close relation between the various stages of dental calculus formation and mineralization with the dental enamel surface, proteins from the dental enamel matrix have not previously been reported outside of dental enamel tissue. Here we reanalysed 498 ancient dental calculus proteomes from 14 published studies (n=434 individuals) reporting the presence of BLG, spanning from the Neolithic to the Victorian Era and applying different protein extraction protocols (FASP, GASP, SP3 and in-solution digestion). Dental enamel matrix proteins were identified in ten studies (n=37 individuals), with amelogenin being the most frequently detected. Enamel peptides occurred more often in studies that applied SP3, although amelogenin was successfully identified through both SP3 and FASP. Structural proteins, including enamelin, ameloblastin, and MMP20, were also identified. The detection of AMELX and AMELY peptide sequences provided new insights into cases where the sex was previously undetermined. These findings establish dental enamel proteins as a new category of biomolecules detected in dental calculus, broadening its application beyond diet and microbiome studies to possible sex estimation. HighlightsO_LIDental calculus entraps oral microbes along with endogenous and exogenous particles during formation and mineralization C_LIO_LIWe conduct reanalysis of 14 published ancient dental calculus studies (n = 434 individuals) spanning the Neolithic to Victorian Era C_LIO_LIDental enamel proteins AMELX, AMELY, AMBN, COL17A1, ENAM and MMP20 are identified in ancient human dental calculus C_LIO_LIAmelogenin was the most frequently detected enamel protein C_LIO_LIWe expand dental calculus palaeoproteomics beyond diet and oral microbiome to potentially include sex estimation C_LI

11
Satellite imagery encodes features predictive of regional mortality and life expectancy

Mitsuyama, Y.; Saito, K.; Kurimoto, S.; Walston, S. L.; Takita, H.; Ueda, D.

2026-05-19 public and global health 10.64898/2026.05.17.26353439 medRxiv
Top 0.1%
1.5%
Show abstract

Background Increasingly accessible satellite imagery provides scalable measures of the built and natural environment relevant to population health. However, whether such imagery can capture subnational variation in mortality and life expectancy remains unclear. We therefore assessed its predictive value for regional mortality and life expectancy across OECD regions. Methods We conducted an ecological, cross-sectional prediction study using 2023 data from OECD Territorial Level 3 (TL3) regions. Annual cloud-masked composites from the Harmonized Landsat and Sentinel-2 collection were processed in the Google Earth Engine, tiled at 224 x 224 pixels, and encoded with the pretrained Prithvi foundation model to derive region-level satellite embeddings. For each outcome, we trained LightGBM regressors for a country-only baseline, a satellite-only model, a combined model (country + satellite), and a final contextual model that additionally included prespecified socioeconomic and environmental covariates. Performance was evaluated using 10-fold outer cross-validation with held-out test folds; R2 was the primary metric. Results The analytic sample comprised 2,414 OECD TL3 regions across 38 countries, for which 939,959 satellite image tiles were processed. In paired bootstrap comparisons, adding satellite features to country indicators improved predictive performance for all outcomes, with incremental R2 ranging from 0.097 to 0.233. The final contextual model achieved R2 values of 0.78 (95% CI, 0.74-0.81) for crude mortality, 0.87 (0.84-0.89) for age-adjusted mortality, 0.86 (0.82-0.88) for infant mortality, and 0.76 (0.69-0.84) for life expectancy. In SHAP analyses, the aggregated satellite image effect consistently ranked among the top predictors across outcomes. Conclusion Satellite imagery captures subnational environmental heterogeneity relevant to regional mortality and life expectancy beyond country identity alone. Earth observation may therefore provide a scalable, complementary data source for characterizing geographic disparities in population health.

12
PIE Toolbox: SSM-PCA Based Software for PET Diagnostic Pattern Analysis

Romanov, M.; Kireev, M.; Didur, M.; Cherednichenko, D.; Korotkov, A.; Valdes-Sosa, P.; Fan, Q.; Wang, Q.

2026-06-01 radiology and imaging 10.64898/2026.05.28.26354341 medRxiv
Top 0.1%
1.5%
Show abstract

One of the prominent methods in neuroimaging data processing is SSM-PCA, which is based on principal component analysis and allows for the identification of diagnostically significant patterns in the form of statistical maps. We developed software, PIE Toolbox, employs SSM-PCA and classification based on the obtained diagnostic patterns revealed from functional and structural tomographic brain imaging. The program supports the entire analysis pipeline including preprocessing of brain images, diagnostic patterns extraction, building classification models, and prediction based on them. The resulting diagnostic patterns are weighted principal components obtained through SSM-PCA, or their linear combinations. PIE Toolbox allows selection of relevant structural and functional brain patterns, computation of their expression values in regions of interest, classification using support vector machines, and evaluation of model performance via cross-validation. This approach enables the use of patterns as features of intergroup differences for individual diagnosis. The software has been validated on both simulated and ADNI datasets.

13
Recovery of genomic and transcriptomic profiles from decades-old FFPE brain tissues

Robinson Christiansen, C.; Hansen Firoozfard, E.; Oskolkov, N.; Gilbert, M. P. T.; Mak, S. S. T.; Wirendfeldt, M.; Kjaer, C.; Marmol-Sanchez, E.

2026-04-22 molecular biology 10.64898/2026.04.20.719637 medRxiv
Top 0.1%
1.4%
Show abstract

Neurological, neurodegenerative, and psychiatric disorders impose substantial morbidity and disability worldwide, yet their molecular basis remains incompletely understood, in part due to limited access to human brain tissue. The Danish Brain Collection, comprising brains from individuals who lived in Danish psychiatric institutions from the 1940s to the 1980s, represents a unique but largely untapped resource for retrospective molecular investigation. Here, we assess the feasibility of extracting and sequencing DNA and RNA from decades-old FFPE brain tissue. We systematically evaluate how extraction and library preparation strategies influence nucleic acid yield and quality, and show that RNA end-repair prior to library preparation substantially enhances transcript diversity, improving data quality from highly degraded samples. Despite extensive fragmentation, we recover biologically informative transcriptomic profiles, including protein-coding and microRNA expression profiles that retain clear tissue specificity. These results establish the Danish Brain Collection as a viable resource for genomic and transcriptomic analyses and demonstrate the broader potential of archival FFPE tissues for large-scale molecular studies.

14
The Origin and Migration of the Ameru Community in Kenya based on mtDNA analysis.

Onyango, D. M.; Anampiu, R.; Ayieko, C.; Magonya, L. A.; Owuor, R. A.; Magaga, G. O.; Andika, B.

2026-04-18 evolutionary biology 10.64898/2026.04.16.718862 medRxiv
Top 0.1%
1.4%
Show abstract

Human diversity did not only remain restricted to their socio-cultural and linguistic domains but also have penetrated deep inside their genetic root. Africa harbors more genetic diversity than any other part of the world. Diversification of the African lineages were complex, involving long-distance gene flow. Data from Africansis needed to better understand the origin and evolution of modern humans, the genetic basis local adaptation, and the evolution of complex traits and related diseases. This analysis formed the basis for this study of determining the origin and migration of the Ameru community in Kenya. Blood samples was collected from 132 male adults of 65 year and above. DNA was extracted and analyzed for the Hyper variable region 1and 2. The sequences were sequenced using Sanger sequence alignment and analyzed using Geneious. Phylogenetic analysis was done using Mega-X while haplotype analysis was done using DNASP software. L1 haplogroup (2.9%) was found among Igembe (7%), Tharaka (6%), and Chuka (7%) and is common in West, Central, and parts of East Africa. L2 haplogroup (6.7%) was present in all subgroups except Imenti and Tigania, indicating West and Central African maternal ancestry. L1 and L2 haplotypes indicate that most Ameru subgroups share partial maternal ancestry from West and Central Africa, while Imenti and Tigania have different maternal lineages. L0-L4 haplogroups indicate predominant East, Central, and West African maternal origins, with subgroups showing variation in haplotype frequencies (e.g., L1 and L2 in Igembe, Tharaka, Chuka; L3 in Tharaka, Mwimbi, Chuka; L4 across all subgroups). Subgroup differences suggest that certain communities, particularly Imenti, have distinct maternal lineages, with less contribution from L1, L2, and L3 but potential links to Afro-Asiatic groups via L4 (found in the Middle East). Non-African haplogroups (N and R) point to historical interactions or shared ancestry with populations in Eurasia and the Horn of Africa, primarily in Tigania and Imenti. Overally, the Ameru maternal gene pool is heterogeneous, shaped by multiple migration routes and interactions across East Africa and beyond, with subgroup-specific maternal histories.

15
Biodesign Buddy: Integrating Generative Artificial Intelligence in Academic Biodesign

Riffle, D.; Rubery, P.

2026-03-13 scientific communication and education 10.64898/2026.03.11.710906 medRxiv
Top 0.1%
1.4%
Show abstract

Biodesign is an interdisciplinary research domain that incorporates principles from design and the life sciences to develop new systems, processes, and objects. Collegiate biodesign educators face unique pedagogical challenges, including an absence of relevant scholarship on curriculum design and instructional best practices for cultivating student scientific literacy. These difficulties may be overcome with newly available technologies, like generative AI systems, that enable personalized learning through domain-specific semantic spaces. This article examines the instructional value of one such domain-specific LLM, Biodesign Buddy, through a mixed-methods analysis of an eight-week study involving 64 students participating in an international biodesign competition. Results indicate strong support for integrating AI into biodesign coursework. Surveys captured attitudes toward AI, scientific literature, and learning experiences to assess AIs impact on learning outcomes. Findings suggest that integrating AI into biodesign pedagogy can meaningfully redress conceptual issues in biodesign while informing broader debates on AIs role in higher education. Impact StatementThis article introduces Biodesign Buddy, a domain-specific generative AI system for collegiate biodesign education, and reports on its exploratory deployment, offering design principles and preliminary findings to inform the development of AI-supported pedagogies for interdisciplinary biodesign instruction.

16
TaxonMatch: taxonomic integration and tree construction from heterogeneous biological databases

Leone, M.; Rech De Laval, V.; Drage, H. B.; Waterhouse, R. M.; Robinson-Rechavi, M.

2026-03-20 evolutionary biology 10.64898/2026.03.18.712418 medRxiv
Top 0.1%
1.3%
Show abstract

Integrating taxonomic data from various sources presents a significant challenge in the study of biodiversity research, due to non-standardized nomenclature and evolving species classifications. Discrepancies between major repositories like the Global Biodiversity Information Facility (GBIF) and the National Center for Biotechnology Information (NCBI), as well as citizen science platforms such as iNaturalist, lead to fragmented and sometimes inaccurate biological data. We present TaxonMatch, a tool designed to address these challenges. TaxonMatch aligns taxonomic names, resolves synonymy, and corrects typographical and structural inconsistencies across databases. We show how it can be used to build a common backbone arthropod taxonomy over NCBI, GBIF and iNaturalist, to find the closest molecular data to a given fossil, and to identify IUCN endangered species with molecular data. TaxonMatch provides a cohesive taxonomic framework and a consistent taxonomic backbone, and can be applied to any taxonomic source. The tool is available at https://github.com/MoultDB/TaxonMatch.

17
A high-performance end-to-end 3D CLEM processing workflow for facilities

Roberge, H.; Woller, T.; Pavie, B.; Hennies, J.; de Heus, C.; Edakkandiyil, L.; Liv, N.; Munck, S.

2026-03-16 cell biology 10.64898/2026.03.13.711046 medRxiv
Top 0.1%
1.3%
Show abstract

Correlative Light and Electron Microscopy (CLEM) integrates the molecular specificity of light microscopy (LM) with the ultrastructural detail of electron microscopy (EM), enabling comprehensive spatial analysis of biological samples. Despite growing demand, processing 3D CLEM datasets remains challenging, specifically for service provision in facilities, due to their multimodal nature and the lack of unified approaches. Typical steps include EM slice alignment, LM-EM registration, segmentation, and 3D visualization. We present a modular, end-to-end pipeline that consolidates existing and newly developed tools into a coherent workflow for 3D CLEM analysis and allows railroading the approach. Designed as interoperable modules accessible through a user-friendly interface, the pipeline is fully open-source and scales from standard workstations to high-performance computing environments to address the need for analysis of growing datasets. While some steps still require manual input, individual components can be automated to increase throughput and reproducibility. Together, this integrated solution lowers technical barriers and supports broader adoption of 3D CLEM methodologies.

18
Development of an early warning system for Nipah outbreak prevention: on-site inactivation, PCR surveillance and sequencing in Bangladesh

Islam, M. N.; Khan, S. A.; Lanszki, Z.; Abraham, A.; Akter, S.; Bhuyan, A. A. M.; Zana, B.; Islam, M. S.; Zeghbib, S.; Leiner, K.; Jani, A. S. M. R.; Sarder, M. J. U.; Islam, M. H.; Debnath, N. C.; Uelmen, J. A.; Banyai, K.; Kemenesi, G.; Chowdhury, S.

2026-03-20 public and global health 10.64898/2026.03.17.26348576 medRxiv
Top 0.2%
1.3%
Show abstract

Background: Mobile laboratory diagnostic technologies for Nipah virus outbreak prevention, mitigation and response remain limited, despite the critical need for such capacities in remote, low-resource regions where most cases occur. We aim to address this gap by implementing a workflow that includes method development, laboratory validation, and field demonstration of a mobile Nipah virus complex diagnostic solution. Methods: We developed a flexible mobile laboratory workflow incorporating PCR capacity, a novel amplicon-based sequencing protocol, and a validated Nipah virus inactivation procedure. Following development and validation, we demonstrated the feasibility of this workflow through repeated field sampling of bat colonies in Nipah virus endemic regions of Bangladesh across multiple field campaigns. Findings: We demonstrated the feasibility of this system for early outbreak response and as a potential early warning tool prior to the emergence of human cases. We detected two urine samples from flying foxes that tested positive and performed full-scale on-site analysis, including qPCR diagnostics and NGS sequencing, within 24 hours. Interpretation: As highlighted in the present study, active surveillance enables outbreak prevention by identifying bat colonies that are actively shedding viruses in real time, even in rural settings. Also, this method can provide rapid, on-site sequence data to track and better understand the genomic diversity of Nipah virus in natural reservoirs during both outbreak and non-outbreak periods. In this study we aimed to establish the foundations of a standard procedure for safe and rapid field testing of Nipah virus in remote areas.

19
Assessment of accuracy of detection dog signaling behavior for the diagnosis of SARS-CoV-2 infection: A Canadian study

Mbutiwi, F. I. N.; Otis, C.; Schiller, I.; LaChance, M.; Martin, L.; Jammal, A.; Odita, A.; Agbaje, N.; Khatib, A.; Dendukuri, N.; Tamim, H.; Troncy, E.; Carabin, H.

2026-03-10 public and global health 10.64898/2026.03.04.26347154 medRxiv
Top 0.2%
1.1%
Show abstract

BackgroundDogs trained to metabolomics detection can identify pathological changes through their refined smelling sense. During the COVID-19 pandemic, studies worldwide evaluated Detection Dog Signaling Behavior (DDSB) for SARS-CoV-2. However, most statistical approaches failed to account for key sources of bias, potentially distorting performance estimates. This study aimed to estimate DDSB accuracy for SARS-CoV-2 infection in a Canadian population while assessing the impact of selected sources of bias on performance estimates. MethodsParticipants attending the COVID-19 assessment clinic at St. Josephs Health Centre, Toronto, were recruited between October and December 2021. Each provided a nasopharyngeal swab for reverse transcription-polymerase chain reaction (RT-PCR) testing and three sweat samples for canine detection. Three dogs were trained to detect SARS-CoV-2 in sweat samples. Validation sessions were video recorded and independently reviewed by two blinded observers. DDSB diagnostic accuracy was estimated against RT-PCR, evaluating the impact of ignoring its imperfect accuracy and repeated sniffing of the same samples during validation. ResultsAmong 2,358 participants (mean age: 34.7 years; 55.7% female), 437 contributed to training. Validation tests included 146 unique participants (25 RT-PCR positive, 121 negative). Assuming RT-PCR was imperfect, DDSB posterior median sensitivity ranged from 67% (95% credible interval [CrI]: 29%-97%) to 78% (95% CrI: 41%-99%), and specificity from 67% (95% CrI: 53%-79%) to 77% (95% CrI: 65%-87%) across the three dogs. Assuming RT-PCR was perfect, sensitivity decreased by 7.9% to 9%, while specificity remained unchanged. Including repeated positive samples without adjustment did not affect specificity estimates but overestimated sensitivity by 7.9% to 13.4% (imperfect RT-PCR) and 11.4% to 18.3% (perfect RT-PCR). ConclusionDDSB shows potential as a non-invasive screening tool for SARS-CoV-2 infection. Our results highlight the challenges of designing such studies and the need for standardized training and validation procedures to ensure the reliability and validity of DDSB.

20
Accessible and Reproducible Renal Cell Carcinoma Research Through Open-Sourcing Data and Annotations

de Boer, S.; Häntze, H.; Ziegelmayer, S.; van Ginneken, B.; Prokop, M.; Bressem, K. K.; Hering, A.

2026-04-23 radiology and imaging 10.64898/2026.04.22.26351451 medRxiv
Top 0.2%
1.1%
Show abstract

BackgroundMedical imaging, especially computed tomography and magnetic resonance imaging, is essential in clinical care of patients with renal cell carcinoma (RCC). Artificial intelligence (AI) research into computer-aided diagnosis, staging and treatment planning needs curated and annotated datasets. Across literature, The Cancer Genome Atlas (TCGA) datasets are widely used for model training and validation. However, re-annotation is often necessary due to limited access to public annotations, raising entry barriers and hindering comparison with prior work. MethodsWe screened 1915 CT scans from three TCGA-RCC databases and employed a segmentation model to annotate kidney lesion. After a meta-data-based exclusion step, we hosted a reader study with all papillary (n=56), chromophobe (n=27) and 200 randomly selected clear cell RCC cases. Two students quality checked and corrected the data as well as annotated tumors and cysts. Uncertain cases were checked by a board-certified radiologist. ResultsAfter data exclusion and quality control a total of 142 annotated CT scans from 101 patients (26 female, 75 male, mean age 56 years) remained. This includes 95 CTs with clear cell RCC, 29 with papillary RCC and 18 with chromophobe RCC. Images and voxel-level annotations of kidneys and lesions are open sourced at https://zenodo.org/records/19630298. ConclusionBy making the annotations open-source, we encourage accessible and reproducible AI research for renal cell carcinoma. We invite other researchers who have previously annotated any of these cohorts to share their annotations.